Boosted Noise Filters for Identifying Mislabeled Data
نویسندگان
چکیده
In many practical classification problems, mislabeled data instances (i.e., class noise) exist in the acquired (training) data and often have a detrimental effect on the classification performance. Identifying such noisy instances and removing them from training data can significantly improve the trained classifiers. One such effective noise detector is the so-called ensemble filter, which predicts the instances misclassified by multiple learned classifiers as noise. This paper proposes a novel noise detection method that uses a boosting ensemble of the ensemble noise filters. Multiple ensemble noise filters are built sequentially, with each one working on weighted instances. The weighting scheme follows the general boosting idea and reduces the weights of those instances that are confidently predicted as noise in previous runs. This method essentially wraps an existing ensemble filter-based noise detector with a second layer of boosting ensemble. Our experimental results on a range of real datasets from the UCI repository show the superiority of the proposed boosted noise detectors.
منابع مشابه
Identifying and Eliminating Mislabeled Training Instances
This paper presents a new approach to identifying and eliminating mislabeled training instances. The goal of this technique is to improve classiication accuracies produced by learning algorithms by improving the quality of the training data. The approach employs an ensemble of clas-siiers that serve as a lter for the training data. Using an n-fold cross validation, the training data is passed t...
متن کاملIdentifying Mislabeled Training Data
This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal of this approach is to improve classiication accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classiiers that serve as noise lters for the training data. We evaluate sin...
متن کاملEliminating Class Noise in Large Datasets
This paper presents a new approach for identifying and eliminating mislabeled instances in large or distributed datasets. We first partition a dataset into subsets, each of which is small enough to be processed by an induction algorithm at one time. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance Ik, two error count variables ...
متن کاملActive cleaning of label noise
Mislabeled examples in the training data can severely affect the performance of supervised classifiers. In this paper, we present an approach to remove any mislabeled examples in the dataset by selecting suspicious examples as targets for inspection. We show that the large margin and soft margin principles used in support vector machines (SVM) have the characteristic of capturing the mislabeled...
متن کاملA noise filtering method using neural networks - Soft Computing Techniques in Instrumentation, Measurement and Related Applications, 2003. SCIMA 20
A = During the data collecling and labeling process it is possible for noise to be introduced into a dato set. As a result, the quality of the data set degrades and experiments and inferences derivedfrom the data set become less reliable. In th tpaper we present an algorithm, called A N R (automati? noise reduction), as apltering mechanism lo identify and remove noisy data items whose classes h...
متن کامل